36 research outputs found

    A tool set for the quick and efficient exploration of large document collections

    Full text link
    We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input. It first carries out some automatic analysis tasks (named entity recognition, geo-coding, clustering, term extraction), annotates the texts with the generated meta-information and stores the meta-information in a database. The system then generates a zoomable and hyperlinked geographic map enhanced with information on entities and terms found. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.Comment: 10 page

    Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives

    Get PDF
    The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building

    A Common XML-based Framework for Syntactic Annotations

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceIt is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes

    East meets West: Producing Multilingual Resources in a European Context

    Get PDF
    International audienceThe EU concerted action TELRI has released a two-volume CD-ROM, which contains multilingual language resources, namelycorpora, lexica, and tools for language engineering. This CD-ROM provides harmonised resources for unprecedented numbers and kindsof languages, mainly from non-EU countries, for which such resources still tend to be scarce. The first volume of the CD includes thealigned text of Plato’s Republic in twenty one languages, while the second volume contains extended results of the EU MULTEXTEastproject, including the aligned and tagged novel ’1984’ by Goerge Orwell and accompanying lexica in seven languages. The paperpresents the CD-ROM, the methods employed in its creation and its prospective uses

    The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

    Full text link
    We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for download at http://langtech.jrc.it/JRC-Acquis.htm

    Towards an international standard on feature structures representation

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceThis paper describes the preliminary results of a joint initiative of the TEI (Text Encoding Initiative) Consortium and the ISO Committee TC 37SC 4 (language Resource management) to provide a standard for the representation and interchange of feature structures
    corecore